智能论文笔记

Transformers learn in-context by gradient descent

Johannes von Oswald , Eyvind Niklasson , Ettore Randazzo , João Sacramento , Alexander Mordvintsev , Andrey Zhmoginov , Max Vladymyrov

分类：机器学习 | 人工智能 | 自然语言处理

2022-12-15

Transformers have become the state-of-the-art neural network architecture across numerous domains of machine learning. This is partly due to their celebrated ability to transfer and to learn in-context based on few examples. Nevertheless, the mechanisms by which Transformers become in-context learners are not well understood and remain mostly an intuition. Here, we argue that training Transformers on auto-regressive tasks can be closely related to well-known gradient-based meta-learning formulations. We start by providing a simple weight construction that shows the equivalence of data transformations induced by 1) a single linear self-attention layer and by 2) gradient-descent (GD) on a regression loss. Motivated by that construction, we show empirically that when training self-attention-only Transformers on simple regression tasks either the models learned by GD and Transformers show great similarity or, remarkably, the weights found by optimization match the construction. Thus we show how trained Transformers implement gradient descent in their forward pass. This allows us, at least in the domain of regression problems, to mechanistically understand the inner workings of optimized Transformers that learn in-context. Furthermore, we identify how Transformers surpass plain gradient descent by an iterative curvature correction and learn linear models on deep data representations to solve non-linear regression tasks. Finally, we discuss intriguing parallels to a mechanism identified to be crucial for in-context learning termed induction-head (Olsson et al., 2022) and show how it could be understood as a specific case of in-context learning by gradient descent learning within Transformers.

translated by 谷歌翻译

我们使用高度紧凑的模型研究基于示例的程序纹理合成问题。考虑到样本图像，我们使用可微分的编程来训练由经常性神经蜂窝自动机（NCA）规则的参数化的生成过程。与普通的信念相反，神经网络应该显着过度参数，我们证明了我们的模型架构和培训过程允许仅使用几百人学习参数表示复杂的纹理模式，使其具有与手工工程的程序纹理生成程序相当的表达效力。从提议的$ \ mu $ nca家族的最小模型降至68个参数。当每个参数使用量化到一个字节时，所提出的模型可以缩小到588和68字节之间的尺寸范围。只有几行GLSL或C代码，可以实现使用这些参数来产生图像的纹理生成器的实现。

translated by 谷歌翻译

本文致力于检测地球图像森林和非林区的问题。我们提出了两个统计方法来解决这个问题：一个基于多假设检测与参数分布家庭，另一个在非参数测试。参数化方法是文献中的新颖，与更大类别的问题相关 - 检测天然对象，以及异常检测。我们为两种方法中的每一种开发数学背景，使用它们构建自充足检测算法，并讨论其实现的数值方面。我们还将我们的算法与使用卫星数据的标准机器学习的算法进行比较。

translated by 谷歌翻译